The data used as an example throughout this report is from Family A from Northland region. Family A has data recorded from 2013 til 2020, but only data from 2013 til 2018 is used in order to compare between the predicted number oranges and the true number of oranges for 2019. The purpose of this is to examine the realibility of the proposed Time Series models.
Aim : to forecast the number of oranges for 2020.
Modelling Procedure
Step 1. Plot the data and identify any unusual observations.
Figure 1. Monthly Data Time Series Plot
Comment: The data shows no particular trend, strong cyclic behavior and there seems to be an unusual pattern between 2015 and 2016. In addition, the mean is not constant, i.e. changes over time, implying non-stationarity. Hence it is sensible to remove data from 2013 up to 2015 to avoid invalid statistical results.
Figure 2. Monthly Data Time Series Plot
Figure 3. New Time Series Monthly Data
Figure 4. New Time Series Monthly Data vs. Seasonally adjusted
Figure 5. STL decompostion of additive components
The decompositon of the Family A data uses the STL method. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating non-linear relationships. It only provides facilities for additive decompositions.
Figure 6. ACF plot of the Tasman data
In the above ACF plot, the dashed yellow lines indicates whether the autocorrelations are (statistically) significantly different from zero within 95% confidence limits. Here the autocorrelations are significantly different from 0, indicating high autocorrelation.
Step 2. Split the data into training and test sets.
The accuracy of forecasts can only be determined by considering how well a model performs on new data that were not used when fitting the model. When choosing models, it is common practice to separate the available data into two portions, training and test sets, where the training data is used to fit a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.
# The data is split into training set (Jan,2017-Dec,2018) and test set (2019).
training <- window(new.tsA, start = c(2016,1), end = c(2018,12))
test <- window(new.tsA, start = 2019)
Step 3. Transform the data from the training set.
Box-Cox transformations is a family of transformations, that includes both logarithms and power transformations, used to transform the Family A data. This is recommended as the Hyndman-Khandakar algorithm of the auto.arima() function only takes care of step 3-5 thus we still have to do steps 1, 2, 6 and 7 manually to ensure the residuals will be roughly homoscedastic. One important feature of power transformation is the \(\lambda\) parameter, where \(\lambda = 0\) is equivalent to a log-transformation. A good value of \(\lambda\) is one which makes the size of the seasonal variation about the same across the whole series, as that makes the forecasting model simpler. The BoxCox.lambda() function can be used for choosing \(\lambda\) automatically instead of doing it manually.
lambda <- BoxCox.lambda(training)
trans.A <- BoxCox(training, lambda)
lambda
## [1] 1.462446
The optimal value of \(\lambda\) is 1.462446. Note: If transformation is required, the data must be transformed AFTER being split into training and test sets in order to avoid data leakage, that is only the training data is transformed and not the test data.
Step 4. Fit the models.
In this step, two methods were considered to fit and predict the data. The methods used are ARIMA (auto.arima()) and ETS (ets()) models.
The auto.arima() function in R uses an algorithm which combines unit root tests, minimisation of the AICc and MLE to obtain an ARIMA model. The arguments stepwise = FALSE and approximation = FALSEare included in the below auto.arima() function to ensure ALL fitted seasonal ARIMA models are considered. The ARIMA model overall has the form : \[ARIMA(p,d,q)(P,D,Q)[m]\], where p is the order of the Autoregressive model, d is the order of differencing and q is the order of the Moving Average model. (P,D,Q) is the same but are defined in the context of seasonality;. The final chosen model returned then has the form : \(ARIMA(1,0,0)\).
auto.arima.model <- auto.arima(training, stepwise = FALSE, approximation = FALSE, lambda = lambda)
auto.arima.model
## Series: training
## ARIMA(1,0,0) with non-zero mean
## Box Cox transformation: lambda= 1.462446
##
## Coefficients:
## ar1 mean
## -0.3918 91.9818
## s.e. 0.1579 7.4633
##
## sigma^2 estimated as 4048: log likelihood=-199.64
## AIC=405.29 AICc=406.04 BIC=410.04
Figure 7. Residuals of ARIMA(0,0,0)(0,1,0)[12] model
##
## Box-Ljung test
##
## data: resid(auto.arima.model)
## X-squared = 23.61, df = 24, p-value = 0.484
The ACF plot of the residuals from the chosen ARIMA model shows that all autocorrelations are within the threshold limits indicating that the residuals are behaving like white noise. The histogram suggests that the residuals may not be normal - the right tail seems a little too long for a normal distribution. The Box-Ljung test returns a large p-value (p-value = 0.484), also suggesting that the residuals resemble white noise.
In the case of ETS (Error, Trend, Seasonal) models, the ets() function can be used to fit these types of model. The notation for each component is defined as Error = {A,M}, Trend = {N,A,Ad} and Seasonal = {N,A,M}, where A stands for additive and M stands for multiplicative.
ets.model <- ets(training, lambda = lambda)
ets.model
## ETS(A,N,N)
##
## Call:
## ets(y = training, lambda = lambda)
##
## Box-Cox transformation: lambda= 1.4624
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 92.0804
##
## sigma: 68.9559
##
## AIC AICc BIC
## 437.7586 438.5086 442.5091
The best ETS model selected is the \(ETS(A,N,N)\), as shown in the result above. Formally, this model is also known as the simple smoothing with additive errors model. In comparison to the \(ARIMA(1,0,0)\) model, the ETS model has a higher AIC value of 437.7586 (the ARIMA model’s AICc value is 405.29).
Figure 8. Residuals of ETS(A,N,A) model
##
## Box-Ljung test
##
## data: resid(ets.model)
## X-squared = 44.069, df = 24, p-value = 0.007491
The ACF plot of the residuals from the chosen ETS model on the hand shows that not all autocorrelations are within the threshold limits indicating that the residuals are bot behaving like white noise. The histogram suggests that the residuals don’t follow a Normal distribution. The Box-Ljung test returns a small p-value (p-value = 0.007491), againsuggesting that the residuals do not resemble white noise.
Step 5. Forecast the data using the fitted models.
Only one thing is true about forecasts - they are always wrong.
The forecast() function can be implemented in order to obtain forecasts from an ARIMA or an ETS model. Having done a transformation on the data, it is necessary to reverse the transformation (or back-transform) when forecasting the transformed data to obtain forecasts on the original scale. This can be done in R by adding the $\lambda$ (equal to the \(\lambda\) value selected for transforming the data) argument to the forecast() function. In addition, the biasadj = TRUE argument indicates that the mean of the forecasts is used and this mean is biased, whereas when biasadj = FALSE (default) the median of the forecasts is used and is not biased.
Figure 9. Forecasts for the seasonally adjusted Tasman data
Figure 10. Forecasts for the seasonally adjusted Tasman data
While linear exponential smoothing models are all special cases of ARIMA models, the non-linear exponential smoothing models have no equivalent ARIMA counterparts. On the other hand, there are also many ARIMA models that have no exponential smoothing counterparts. In particular, all ETS models are non-stationary, while some ARIMA models are stationary.
a1 <- auto.arima.model %>% forecast(h = 12) %>%
accuracy(test)
a1[,c("RMSE", "MAE", "MAPE", "MASE")]
## RMSE MAE MAPE MASE
## Training set 14.22280 12.26324 264.0349 0.7546609
## Test set 18.73575 16.27520 Inf 1.0015507
a2 <- ets.model %>% forecast(h = 12) %>%
accuracy(test)
a2[,c("RMSE", "MAE", "MAPE", "MASE")]
## RMSE MAE MAPE MASE
## Training set 15.42341 13.41716 305.1204 0.8256711
## Test set 18.11619 16.11958 Inf 0.9919743
The R output shows the forecasting performance of the two competing models over the test data. In this case the ETS model seems to be the slightly more accurate model based on the test set RMSE, MAPE and MASE though the ARIMA model fits the training data slightly better than the ETS model. The output table shows the point forecasts for the ARIMA model are over-estimated against the real data whereas the point forecasts for the ETS model are all the same value for each month.
| 2019 | True Data | Predicted ARIMA | Predicted ETS |
|---|---|---|---|
| Jan | 0 | 33 | 27 |
| Feb | 19 | 24 | 27 |
| Mar | 47 | 28 | 27 |
| Apr | 2 | 27 | 27 |
| May | 46 | 27 | 27 |
| Jun | 15 | 27 | 27 |
| Jul | 49 | 27 | 27 |
| Aug | 3 | 27 | 27 |
| Sep | 33 | 27 | 27 |
| Oct | 22 | 27 | 27 |
| Nov | 47 | 27 | 27 |
| Dec | 25 | 27 | 27 |
The analysis of the Family A data without transformation (skipping Step 3) is summarized in this section.
Step 5. Fit the models.
The best model as shown in the output is the ARIMA(1,0,0)(0,1,0)[12] with drift with AIC = 298.27.
## Series: training
## ARIMA(1,0,0) with non-zero mean
##
## Coefficients:
## ar1 mean
## -0.4077 26.5683
## s.e. 0.1560 1.6691
##
## sigma^2 estimated as 207: log likelihood=-146.14
## AIC=298.27 AICc=299.02 BIC=303.02
Figure 11. Residuals of ARIMA(1,0,0)(0,1,0)[12] model
##
## Box-Ljung test
##
## data: resid(auto.arima.model2)
## X-squared = 23.029, df = 24, p-value = 0.5181
The ACF plot of the residuals from the chosen ARIMA model shows that all autocorrelations are within the threshold limits indicating that the residuals are behaving like white noise. The histogram suggests that the residuals appear to roughly follow a normal distribution. The Box-Ljung test returns a large p-value (p-value = 0.5181), also suggesting that the residuals resemble white noise.
The best ETS model selected is the \(ETS(M,N,N)\), as shown in the result above. Formally, this model is also known as the seasonal multiplicative Holt’s Winters model with multiplicative errors. In comparison to the \(ARIMA(1,0,0)\) model, the ETS model has a higher AIC value of 331.2937 (the ARIMA model’s AIC value is 298.27).
ets.model2 <- ets(training)
ets.model2
## ETS(M,N,N)
##
## Call:
## ets(y = training)
##
## Smoothing parameters:
## alpha = 1e-04
##
## Initial states:
## l = 26.5861
##
## sigma: 0.5912
##
## AIC AICc BIC
## 331.2937 332.0437 336.0442
Figure 12. Residuals of ETS(M,N,M) model
##
## Box-Ljung test
##
## data: resid(ets.model2)
## X-squared = 48.239, df = 24, p-value = 0.002358
The ACF plot of the residuals from the chosen ETS model shows that not all autocorrelations are within the threshold limits indicating that the residuals aren’t behaving like white noise. However, the histogram suggests that the residuals may not be normal. The Box-Ljung test returns a very small p-value (p-value = 0.002358), also suggesting that the residuals do not resemble white noise.
Step 4. Forecast the data using the fitted models.
Figure 13. Forecasts for the seasonally adjusted Tasman data
Figure 14. Forecasts for the seasonally adjusted Tasman data
| 2019 | True Data | Predicted ARIMA | Predicted ETS |
|---|---|---|---|
| Jan | 0 | 34 | 27 |
| Feb | 19 | 24 | 27 |
| Mar | 47 | 28 | 27 |
| Apr | 2 | 26 | 27 |
| May | 46 | 27 | 27 |
| Jun | 15 | 27 | 27 |
| Jul | 49 | 27 | 27 |
| Aug | 3 | 27 | 27 |
| Sep | 33 | 27 | 27 |
| Oct | 22 | 27 | 27 |
| Nov | 47 | 27 | 27 |
| Dec | 25 | 27 | 27 |
Overall the transformed data for both models have lower AIC than that for the non-transformed data. The forecasts from the non-transformed data are similar to the transformed data for both models. This could be due to the reason \(\lambda = 1.4\) which can be assumed as \(\lambda = 1\), and \(Y^{\lambda} = Y^1 = Y\).
| Types of Data | ARIMA | ETS | AIC for ARIMA | AIC for ETS |
|---|---|---|---|---|
| Transformed data | ARIMA(1,0,0) | ETS(A,N,N) | 405.29 | 437.7586 |
| Non-transformed data | ARIMA(1,0,0) | ETS(M,N,N) | 298.27 | 331.2937 |
Obtain potential predictors that might be correlated to the monthly number of oranges from external sources:
Decompose annual GDP by region into monthly data.
Dummy seasonal (monthly) predictors.https://stackoverflow.com/questions/59672182/prediction-interval-level-on-forecast-autoplot
Page built on: 📆 2020-07-10 ‒ 🕢 12:01:12